91 research outputs found

    Multivariate Prediction Models for Bio-Analytical Data

    No full text
    Quantitative bio-analytical techniques that enable parallel measurements of large numbers of biomolecules generate vast amounts of information for studying and characterising biological systems. These analytical methods are commonly referred to as omics technologies, and can be applied for measurements of e.g. mRNA transcript, protein or metabolite abundances in a biological sample. The work presented in this thesis focuses on the application of multivariate prediction models for modelling and analysis of biological data generated by omics technologies. Omics data commonly contain up to tens of thousands of variables, which are often both noisy and multicollinear. Multivariate statistical methods have previously been shown to be valuable for visualisation and predictive modelling of biological and chemical data with similar properties to omics data. In this thesis currently available multivariate modelling methods are used in new applications, and new methods are developed to address some of the specific challenges associated with modelling of biological data. Three closely related areas of multivariate modelling of biological data are described and demonstrated in this thesis. First, a multivariate projection method is used in a novel application for predictive modelling between omics data sets, demonstrating how data from two analytical sources can be integrated and modelled to- gether by exploring covariation patterns between the data sets. This approach is exemplified by modelling of data from two studies, the first containing proteomic and metabolic profiling data and the second containing transcriptomic and metabolic profiling data. Second, a method for piecewise multivariate modelling of short timeseries data is developed and demonstrated by modelling of simulated data as well as metabolic profiling data from a toxicity study, providing a new method for characterisation of multivariate bio-analytical time-series data. Third, a kernel-based method is developed and applied for non-linear multivariate prediction modelling of omics data, addressing the specific challenge of modelling non-linear variation in biological data

    Robust Linear Models for Cis-eQTL Analysis

    Get PDF
    Expression Quantitative Trait Loci (eQTL) analysis enables characterisation of functional genetic variation influencing expression levels of individual genes. In outbread populations, including humans, eQTLs are commonly analysed using the conventional linear model, adjusting for relevant covariates, assuming an allelic dosage model and a Gaussian error term. However, gene expression data generally have noise that induces heavy-tailed errors relative to the Gaussian distribution and often include atypical observations, or outliers. Such departures from modelling assumptions can lead to an increased rate of type II errors (false negatives), and to some extent also type I errors (false positives). Careful model checking can reduce the risk of type-I errors but often not type II errors, since it is generally too time-consuming to carefully check all models with a non-significant effect in large-scale and genome-wide studies. Here we propose the application of a robust linear model for eQTL analysis to reduce adverse effects of deviations from the assumption of Gaussian residuals. We present results from a simulation study as well as results from the analysis of real eQTL data sets. Our findings suggest that in many situations robust models have the potential to provide more reliable eQTL results compared to conventional linear models, particularly in respect to reducing type II errors due to non-Gaussian noise. Post-genomic data, such as that generated in genome-wide eQTL studies, are often noisy and frequently contain atypical observations. Robust statistical models have the potential to provide more reliable results and increased statistical power under non-Gaussian conditions. The results presented here suggest that robust models should be considered routinely alongside other commonly used methodologies for eQTL analysis.NonePublishe

    Study design requirements for RNA sequencing-based breast cancer diagnostics

    Get PDF
    Sequencing-based molecular characterization of tumors provides information required for individualized cancer treatment. There are well-defined molecular subtypes of breast cancer that provide improved prognostication compared to routine biomarkers. However, molecular subtyping is not yet implemented in routine breast cancer care. Clinical translation is dependent on subtype prediction models providing high sensitivity and specificity. In this study we evaluate sample size and RNA-sequencing read requirements for breast cancer subtyping to facilitate rational design of translational studies. We applied subsampling to ascertain the effect of training sample size and the number of RNA sequencing reads on classification accuracy of molecular subtype and routine biomarker prediction models (unsupervised and supervised). Subtype classification accuracy improved with increasing sample size up to N = 750 (accuracy = 0.93), although with a modest improvement beyond N = 350 (accuracy = 0.92). Prediction of routine biomarkers achieved accuracy of 0.94 (ER) and 0.92 (Her2) at N = 200. Subtype classification improved with RNA-sequencing library size up to 5 million reads. Development of molecular subtyping models for cancer diagnostics requires well-designed studies. Sample size and the number of RNA sequencing reads directly influence accuracy of molecular subtyping. Results in this study provide key information for rational design of translational studies aiming to bring sequencing-based diagnostics to the clinic.NonePublishe

    K-OPLS package: Kernel-based orthogonal projections to latent structures for prediction and interpretation in feature space

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Kernel-based classification and regression methods have been successfully applied to modelling a wide variety of biological data. The Kernel-based Orthogonal Projections to Latent Structures (K-OPLS) method offers unique properties facilitating separate modelling of predictive variation and structured noise in the feature space. While providing prediction results similar to other kernel-based methods, K-OPLS features enhanced interpretational capabilities; allowing detection of unanticipated systematic variation in the data such as instrumental drift, batch variability or unexpected biological variation.</p> <p>Results</p> <p>We demonstrate an implementation of the K-OPLS algorithm for MATLAB and R, licensed under the GNU GPL and available at <url>http://www.sourceforge.net/projects/kopls/</url>. The package includes essential functionality and documentation for model evaluation (using cross-validation), training and prediction of future samples. Incorporated is also a set of diagnostic tools and plot functions to simplify the visualisation of data, e.g. for detecting trends or for identification of outlying samples. The utility of the software package is demonstrated by means of a metabolic profiling data set from a biological study of hybrid aspen.</p> <p>Conclusion</p> <p>The properties of the K-OPLS method are well suited for analysis of biological data, which in conjunction with the availability of the outlined open-source package provides a comprehensive solution for kernel-based analysis in bioinformatics applications.</p

    Determining breast cancer histological grade from RNA-sequencing data

    Get PDF
    BACKGROUND: The histologic grade (HG) of breast cancer is an established prognostic factor. The grade is usually reported on a scale ranging from 1 to 3, where grade 3 tumours are the most aggressive. However, grade 2 is associated with an intermediate risk of recurrence, and carries limited information for clinical decision-making. Patients classified as grade 2 are at risk of both under- and over-treatment. METHODS: RNA-sequencing analysis was conducted in a cohort of 275 women diagnosed with invasive breast cancer. Multivariate prediction models were developed to classify tumours into high and low transcriptomic grade (TG) based on gene- and isoform-level expression data from RNA-sequencing. HG2 tumours were reclassified according to the prediction model and a recurrence-free survival analysis was performed by the multivariate Cox proportional hazards regression model to assess to what extent the TG model could be used to stratify patients. The prediction model was validated in N=487 breast cancer cases from the The Cancer Genome Atlas (TCGA) data set. Differentially expressed genes and isoforms associated with HGs were analysed using linear models. RESULTS: The classification of grade 1 and grade 3 tumours based on RNA-sequencing data achieved high accuracy (area under the receiver operating characteristic curve = 0.97). The association between recurrence-free survival rate and HGs was confirmed in the study population (hazard ratio of grade 3 versus 1 was 2.62 with 95 % confidence interval = 1.04-6.61). The TG model enabled us to reclassify grade 2 tumours as high TG and low TG gene or isoform grade. The risk of recurrence in the high TG group of grade 2 tumours was higher than in low TG group (hazard ratio = 2.43, 95 % confidence interval = 1.13-5.20). We found 8200 genes and 13,809 isoforms that were differentially expressed between HG1 and HG3 breast cancer tumours. CONCLUSIONS: Gene- and isoform-level expression data from RNA-sequencing could be utilised to differentiate HG1 and HG3 tumours with high accuracy. We identified a large number of novel genes and isoforms associated with HG. Grade 2 tumours could be reclassified as high and low TG, which has the potential to reduce over- and under-treatment if implemented clinically.NonePublishe

    Integration of transcriptomics and metabonomics: improving diagnostics, biomarker identification and phenotyping in ulcerative colitis

    Get PDF
    A systems biology approach to multi-faceted diseases has provided an opportunity to establish a holistic understanding of the processes at play. Thus, the current study merges transcriptomics and metabonomics data in order to improve diagnostics, biomarker identification and to explore the possibilities of a molecular phenotyping of ulcerative colitis (UC) patients. Biopsies were obtained from the descending colon of 43 UC patients (22 active UC and 21 quiescent UC) and 15 controls. Genome-wide gene expression analyses were performed using Affymetrix GeneChip Human Genome U133 Plus 2.0. Metabolic profiles were generated using (1)H Nuclear magnetic resonance spectroscopy (Bruker 600 MHz, Bruker BioSpin, Rheinstetten, Germany). Data were analyzed with the use of orthogonal-projection to latent structure-discriminant analysis and a multivariate logistic regression model fitted by lasso. Prediction performance was evaluated using nested Monte Carlo cross-validation. The prediction performance of the merged data sets and that of relative small (<20 variables) multivariate biomarker panels suggest that it is possible to discriminate between active UC, quiescent UC, and controls; between patients with or without steroid dependency, as well as between early or late disease onset. Consequently, this study demonstrates that the novel approach of integrating metabonomics and transcriptomics combines the better of the two worlds, and provides us with clinical applicable candidate biomarker panels. These combined panels improve diagnostics and more importantly also the molecular phenotyping in UC and provide insight into the pathophysiological processes at play, making optimized and personalized medication a possibility. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1007/s11306-013-0580-3) contains supplementary material, which is available to authorized users

    Intra-tumor heterogeneity in breast cancer has limited impact on transcriptomic-based molecular profiling

    Get PDF
    Background: Transcriptomic profiling of breast tumors provides opportunity for subtyping and molecular-based patient stratification. In diagnostic applications the specimen profiled should be representative of the expression profile of the whole tumor and ideally capture properties of the most aggressive part of the tumor. However, breast cancers commonly exhibit intra-tumor heterogeneity at molecular, genomic and in phenotypic level, which can arise during tumor evolution. Currently it is not established to what extent a random sampling approach may influence molecular breast cancer diagnostics. Methods: In this study we applied RNA-sequencing to quantify gene expression in 43 pieces (2-5 pieces per tumor) from 12 breast tumors (Cohort 1). We determined molecular subtype and transcriptomic grade for all tumor pieces and analysed to what extent pieces originating from the same tumors are concordant or discordant with each other. Additionally, we validated our finding in an independent cohort consisting of 19 pieces (2-6 pieces per tumor) from 6 breast tumors (Cohort 2) profiled using microarray technique. Exome sequencing was also performed on this cohort, to investigate the extent of intra-tumor genomic heterogeneity versus the intra-tumor molecular subtype classifications. Results: Molecular subtyping was consistent in 11 out of 12 tumors and transcriptomic grade assignments were consistent in 11 out of 12 tumors as well. Molecular subtype predictions revealed consistent subtypes in four out of six patients in this cohort 2. Interestingly, we observed extensive intra-tumor genomic heterogeneity in these tumor pieces but not in their molecular subtype classifications. Conclusions: Our results suggest that macroscopic intra-tumoral transcriptomic heterogeneity is limited and unlikely to have an impact on molecular diagnostics for most patients.Peer reviewe

    Variance decomposition of protein profiles from antibody arrays using a longitudinal twin model

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The advent of affinity-based proteomics technologies for global protein profiling provides the prospect of finding new molecular biomarkers for common, multifactorial disorders. The molecular phenotypes obtained from studies on such platforms are driven by multiple sources, including genetic, environmental, and experimental components. In characterizing the contribution of different sources of variation to the measured phenotypes, the aim is to facilitate the design and interpretation of future biomedical studies employing exploratory and multiplexed technologies. Thus, biometrical genetic modelling of twin or other family data can be used to decompose the variation underlying a phenotype into biological and experimental components.</p> <p>Results</p> <p>Using antibody suspension bead arrays and antibodies from the Human Protein Atlas, we study unfractionated serum from a longitudinal study on 154 twins. In this study, we provide a detailed description of how the variation in a molecular phenotype in terms of protein profile can be decomposed into familial i.e. genetic and common environmental; individual environmental, short-term biological and experimental components. The results show that across 69 antibodies analyzed in the study, the median proportion of the total variation explained by familial sources is 12% (IQR 1-22%), and the median proportion of the total variation attributable to experimental sources is 63% (IQR 53-72%).</p> <p>Conclusion</p> <p>The variability analysis of antibody arrays highlights the importance to consider variability components and their relative contributions when designing and evaluating studies for biomarker discoveries with exploratory, high-throughput and multiplexed methods.</p

    Sequencing-based breast cancer diagnostics as an alternative to routine biomarkers

    Get PDF
    Sequencing-based breast cancer diagnostics have the potential to replace routine biomarkers and provide molecular characterization that enable personalized precision medicine. Here we investigate the concordance between sequencing-based and routine diagnostic biomarkers and to what extent tumor sequencing contributes clinically actionable information. We applied DNA- and RNA-sequencing to characterize tumors from 307 breast cancer patients with replication in up to 739 patients. We developed models to predict status of routine biomarkers (ER, HER2,Ki-67, histological grade) from sequencing data. Non-routine biomarkers, including mutations in BRCA1, BRCA2 and ERBB2(HER2), and additional clinically actionable somatic alterations were also investigated. Concordance with routine diagnostic biomarkers was high for ER status (AUC = 0.95;AUC(replication) = 0.97) and HER2 status (AUC = 0.97;AUC(replication) = 0.92). The transcriptomic grade model enabled classification of histological grade 1 and histological grade 3 tumors with high accuracy (AUC = 0.98;AUC(replication) = 0.94). Clinically actionable mutations in BRCA1, BRCA2 and ERBB2(HER2) were detected in 5.5% of patients, while 53% had genomic alterations matching ongoing or concluded breast cancer studies. Sequencing-based molecular profiling can be applied as an alternative to histopathology to determine ER and HER2 status, in addition to providing improved tumor grading and clinically actionable mutations and molecular subtypes. Our results suggest that sequencing-based breast cancer diagnostics in a near future can replace routine biomarkersNonePublishe
    corecore